Add `explicit_hydrogen` parameter by padix-key · Pull Request #741 · biotite-dev/biotite

padix-key · 2025-01-24T14:13:04Z

This parameter in interface.rdkit.to_mol() defines whether hydrogen should be explicitly or implicitly included in the created Mol

codspeed-hq · 2025-01-24T14:27:25Z

CodSpeed Performance Report

Merging #741 will not alter performance

_{Comparing padix-key:rdkit (e424739) with main (a974bb8)}

Summary

✅ 59 untouched benchmarks

Croydon-Brixton · 2025-01-27T14:28:14Z

src/biotite/interface/rdkit/mol.py

        if has_charge_annot:
            rdkit_atom.SetFormalCharge(atoms.charge[i].item())
+        if explicit_hydrogen:
+            rdkit_atom.SetNoImplicit(True)


In the base case where explicity_hydrogen=True, what would this mean if I just call to_mol for a typical crystal structure that won't have any hydrogens resolved? Will RDKit infer charges / valences in that case? If so this might lead to broken molecules -- if that were the case, should we check that if explicit hydrogen is true there have to be hydrogens in the structure? (I could see this being sth users will try -- at least I would have tried it :D )

Charges would also not be inferred automatically before this PR. But I am not sure what this means for valence: As the bond types are also explicitly set I guess RDKit assumes a radical, if explicit_hydrogen=True, but hydrogen atoms are actually missing. Probably, I should test this.

Having a check there sounds like a good idea, especially as I would agree that this could be a common mistake. However, I also think strictly checking for the simple presence of hydrogen atoms might not be sensible enough, as there are valid molecules without hydrogen atoms, although they appear rarely. What do you think about raising a warning as a 'reminder' to the user to check the input?

padix-key · 2025-02-02T10:09:20Z

I checked the different behaviors when an AtomArray without hydrogen is passed to as_mol() with explicit_hydrogen being False/True:

import biotite.interface.rdkit as rdkit_interface
import biotite.structure.info as info
import rdkit.Chem.AllChem as Chem
import rdkit.Chem.Draw as Draw

atoms = info.residue("C")
atoms = atoms[atoms.element != "H"]

mol = rdkit_interface.to_mol(atoms, explicit_hydrogen=True)
Chem.Compute2DCoords(mol)
Draw.MolToFile(mol, "explicit.png")

mol = rdkit_interface.to_mol(atoms, explicit_hydrogen=False)
Chem.Compute2DCoords(mol)
Draw.MolToFile(mol, "non_explicit.png")

explicit.png:

non_explicit.png:

So rdkit correctly understands that in the latter case it should additional hydrogen atoms to the visualization, as the Mol already contains the hydrogen implicitly.

padix-key · 2025-02-07T09:09:08Z

I added a warning in case the structure contains no hydrogen atoms. @Croydon-Brixton Could you have a look again?

Croydon-Brixton · 2025-02-12T11:05:27Z

src/biotite/interface/rdkit/mol.py

    HXT
    """
-    mol = EditableMol(Mol())
+    hydrogen_mask = atoms.element == "H"


nit: I just remembered that I've seen the very occasional structures with deuterium as well (e.g. 1wq2). How would we want to deal with these cases? Do we want to treat this isotope as hydrogen as well (probably chemically most appropriate) or not? If so, the PDB represents deuterium as the element "D"

This is indeed an important edge case! I agree, that deuterium should be handled like hydrogen. However, this problem is not isolated to to_mol() but involves difference places in the code base. Hence I added a separate issue (#758) and propose to handle this afterwards.

I also thought about it again, and I would prefer not removing hydrogen atoms, if explicit_hydrogen=False, but raise an exception instead. This would ensure that the atom indices of the created Mol always corresponds to the atom indices of the input AtomArray. This would for example matter when using indices obtained from Mol.GetSubstructMatch() to filter the AtomArray.

src/biotite/interface/rdkit/mol.py

Croydon-Brixton · 2025-02-12T11:10:26Z

src/biotite/interface/rdkit/mol.py

+    if explicit_hydrogen:
+        if not hydrogen_mask.any():
+            warnings.warn(
+                "No hydrogen found in the input, although 'explicit_hydrogen' is 'True'"


Suggested change

"No hydrogen found in the input, although 'explicit_hydrogen' is 'True'"

"No hydrogen found in the input, although 'explicit_hydrogen' is 'True'. "

"This may lead to atoms wrongly interpreted as radicals. Be careful."

Croydon-Brixton · 2025-02-12T11:48:23Z

src/biotite/interface/rdkit/mol.py

-        rdkit_atom = Atom(atoms.element[i].capitalize())
+        rdkit_atom = Chem.Atom(atoms.element[i].capitalize())
+        if explicit_hydrogen:
+            rdkit_atom.SetNoImplicit(True)


Hmm to my mind this option is still problematic, at least given the defaults of the to_mol function:
By default most PDBs won't have any hydrogens present, but the default in to_mol here is explicit_hydrogen=True. This will lead to radicals upon calling Chem.Sanitize, which is undesirable almost always and will probably catch a lot of users (especially those less well versed in RDKit) off guard.

I would suggest:

Changing the default to not expect explicit hydrogens.

Emitting a warning if a user says they're having explicit hydrogens, but no hydrogens are found, with a note that this can lead to radicals. After that, they're on their own ^^

Probably you are right, that the more common case is explicit_hydrogen=False 👍.

Croydon-Brixton · 2025-02-12T11:53:58Z

Sorry it wouldn't let me add this to the review summary earlier, but here's my analysis that exhibits the radicals issue:

Our base case (usual PDB files + default options)

The other cases:

Here's the code to reproduce:

import biotite.structure as struc
from biotite.interface.rdkit.mol import to_mol, from_mol
import rdkit.Chem.AllChem as Chem
from typing import *

try:
    # Settings for debugging & interactive tests
    from rdkit.Chem.Draw import IPythonConsole

    IPythonConsole.kekulizeStructures = False
    IPythonConsole.drawOptions.addAtomIndices = False
    IPythonConsole.ipython_3d = False
    IPythonConsole.ipython_useSVG = True
    IPythonConsole.drawOptions.addStereoAnnotation = True
    IPythonConsole.molSize = 300, 200
except ImportError:
    pass

tyr = struc.info.residue("TYR")
tyr.chain_id[:] = "A"
tyr.res_id[:] = 17
tyr_no_h = tyr[tyr.element != "H"]

def has_radicals(mol, return_indices: bool = False) -> Union[bool, Tuple[bool, List[int]]]:
    if mol is None:
        raise ValueError("Input molecule is None")
        
    radical_indices = []
    
    for atom in mol.GetAtoms():
        # Get number of radical electrons
        num_radical_electrons = atom.GetNumRadicalElectrons()
        if num_radical_electrons > 0:
            radical_indices.append(atom.GetIdx())
    
    has_any_radicals = len(radical_indices) > 0
    
    if return_indices:
        return has_any_radicals, radical_indices
    return has_any_radicals

def is_sanitized(mol) -> bool:
    if mol is None:
        return False
        
    try:
        # Create a copy to avoid modifying the input
        mol_copy = Chem.Mol(mol)
        
        # Try to sanitize with all checks enabled
        # catchErrors=True prevents exceptions from being raised
        result = Chem.SanitizeMol(mol_copy, catchErrors=True)
        
        # SANITIZE_NONE (0) means all checks passed
        return result == Chem.SanitizeFlags.SANITIZE_NONE
        
    except Exception:
        return False
    
rad_color = lambda has_rad: '\033[32m' if not has_rad else '\033[31m'  # Green if False, Red if True
san_color = lambda is_san: '\033[32m' if is_san else '\033[31m'      # Green if True, Red if False
reset_color = '\033[0m'

i = 0
for m in [tyr_no_h, tyr]:
    for explicit_hydrogen in [True, False]:
        mol = to_mol(m, explicit_hydrogen=explicit_hydrogen)
        Chem.Compute2DCoords(mol)
        print(f"------- {i}: has_H={'H' in m.element}, {explicit_hydrogen=} -------")
        
        # Color the output based on the conditions
        print("before sanitize")
        has_rad = has_radicals(mol)
        is_san = is_sanitized(mol)
        print(f"{rad_color(has_rad)}has_radicals(mol): {has_rad}{reset_color}")
        print(f"{san_color(is_san)}is_sanitized(mol): {is_san}{reset_color}")
        display(mol);

        Chem.SanitizeMol(mol)
        print("after sanitize")
        has_rad = has_radicals(mol)
        is_san = is_sanitized(mol)
        print(f"{rad_color(has_rad)}has_radicals(mol): {has_rad}{reset_color}")
        print(f"{san_color(is_san)}is_sanitized(mol): {is_san}{reset_color}")
        display(mol);

        print("converted back into biotite structure")
        print(repr(from_mol(mol, conformer_id=0)))

        i += 1

Croydon-Brixton · 2025-02-12T11:55:43Z

Not directly related to the radicals, but another thing I noted while looking at this:
The from_mol function returns an empty stack when no conformer is given for this test case (indeed there is no 3D conformer, but there is a valid molecule with bonded structure). I think we want users to be able to access the underlying rdkit molecule even in the absence of a conformer -- in my work I needed this type of workflow for example to compute some chemical properties for molecules 'in the abstract'. I suggest to just return 'nan' coordinates for those cases (c.f. my suggested edits in the PR below)

Croydon-Brixton · 2025-02-12T12:17:47Z

@padix-key I've added my suggested changes here for your convenience: padix-key#13

padix-key · 2025-02-15T18:24:29Z

Thanks @Croydon-Brixton for your support. Your suggested changes look basically good to me. I would have some minor suggestions, but it is probably less confusing, if I bring them up in this PR.

src/biotite/interface/rdkit/mol.py

padix-key · 2025-02-15T23:00:57Z

I mainly added three changes:

The default value of explicit_hydrogen is None, which infers the argument from the presence of hydrogen atoms.
I added the ability to select either 2D or 3D conformers. If no (matching) conformer is present, a model with NaN coordinates is returned.
I refactored the extra annotations a bit: Before setting properties other than strings was not possible.

… Also set coordinates of non-3d conformers to nan even when explicitly requested, as they are meaningless.

…g them as 'str' objects. Also type-cast the rdkit internal 'isImplicit' flag.

Co-authored-by: Simon Mathis <simon.mathis@gmail.com>

padix-key · 2025-02-24T13:10:17Z

@Croydon-Brixton Do you think this PR is ready to merge?

Croydon-Brixton

Oops, sorry the requested re-review somehow slipped through my fingers!

Yes, the changes look all good to me and from my point of view this is ready to merge. While reading over it again I just found one tiny typo.

I like the new default of 'None' for our explicit hydrogen policy. I think inferring makes a lot of sense and the edge cases where this isn't desired (e.g. some small molecule that truly has no hydrogens) should be harmless, since re-inferring should conclude that indeed no hydrogens are needed.
The selection of 2D & 3D conformers is very neat, thank you!!
Thank you for finding a way to keep the annotation dtypes preserved where possible!

src/biotite/interface/rdkit/mol.py

Croydon-Brixton · 2025-02-24T13:16:41Z

src/biotite/interface/rdkit/mol.py

-    conformer_id : int, optional
-        The conformer to be converted.
-        By default, an :class:`AtomArrayStack` with all conformers is returned.
+    conformer_id : int or {"2D", "3D"}, optional


Croydon-Brixton · 2025-02-24T13:18:33Z

tests/interface/test_rdkit.py

+    assert test_annot.dtype == ref_annot.dtype
+    assert test_annot.tolist() == ref_annot.tolist()


Great to see the annotation dtypes preserved! Well done for finding a way to do that :D

Co-authored-by: Simon Mathis <simon.mathis@gmail.com>

padix-key · 2025-02-24T14:19:28Z

Thanks again! As the failing test will be fixed by #761, I will merge this PR.

* Add `explicit_hydrogen` parameter * Use canonical rdkit import * fix: enable returning molecules without conformers (with nan coords). Also set coordinates of non-3d conformers to nan even when explicitly requested, as they are meaningless. * feat: attempt to type-cast extra annotations instead of always leaving them as 'str' objects. Also type-cast the rdkit internal 'isImplicit' flag. * fix: suggestion for default case & warnings for explicit hydrogens * Update src/biotite/interface/rdkit/mol.py Co-authored-by: Simon Mathis <simon.mathis@gmail.com> * Infer `explicit_hydrogen` from presence of hydrogen atoms * Add author * Make 2D and 3D conformers selectable * Allow multiple types for extra annotations * Note that the atoms are in the same order * Fix typo Co-authored-by: Simon Mathis <simon.mathis@gmail.com> --------- Co-authored-by: Simon Mathis <simon.mathis@gmail.com>

padix-key temporarily deployed to publish January 24, 2025 14:34 — with GitHub Actions Inactive

Croydon-Brixton reviewed Jan 27, 2025

View reviewed changes

padix-key force-pushed the rdkit branch from ae9f239 to 4a04f4b Compare February 2, 2025 10:25

padix-key requested a review from Croydon-Brixton February 2, 2025 10:25

padix-key force-pushed the rdkit branch from 4a04f4b to 29565c4 Compare February 3, 2025 10:32

padix-key force-pushed the rdkit branch from 29565c4 to 167c31a Compare February 7, 2025 09:18

padix-key temporarily deployed to publish February 7, 2025 09:48 — with GitHub Actions Inactive

Croydon-Brixton requested changes Feb 12, 2025

View reviewed changes

Croydon-Brixton mentioned this pull request Feb 12, 2025

Fix/rdkit suggestions padix-key/biotite#13

Merged

padix-key mentioned this pull request Feb 14, 2025

RDKit Conversion Issue #756

Closed

padix-key commented Feb 15, 2025

View reviewed changes

src/biotite/interface/rdkit/mol.py Outdated Show resolved Hide resolved

padix-key force-pushed the interfaces branch from 4c7204a to 41a843f Compare February 20, 2025 09:41

padix-key force-pushed the rdkit branch from 7207f1b to dd4c1e3 Compare February 20, 2025 09:44

padix-key and others added 9 commits February 23, 2025 15:04

Add explicit_hydrogen parameter

dd00e32

Use canonical rdkit import

30b8eb3

fix: enable returning molecules without conformers (with nan coords).…

6ced9ea

… Also set coordinates of non-3d conformers to nan even when explicitly requested, as they are meaningless.

feat: attempt to type-cast extra annotations instead of always leavin…

c111e79

…g them as 'str' objects. Also type-cast the rdkit internal 'isImplicit' flag.

fix: suggestion for default case & warnings for explicit hydrogens

88cdfee

Update src/biotite/interface/rdkit/mol.py

c678317

Co-authored-by: Simon Mathis <simon.mathis@gmail.com>

Infer explicit_hydrogen from presence of hydrogen atoms

8235f01

Add author

81fa257

Make 2D and 3D conformers selectable

2964417

padix-key added 2 commits February 23, 2025 15:04

Allow multiple types for extra annotations

f30ca15

Note that the atoms are in the same order

aed49b5

padix-key force-pushed the rdkit branch from e51cebf to aed49b5 Compare February 23, 2025 14:04

Croydon-Brixton approved these changes Feb 24, 2025

View reviewed changes

Fix typo

e424739

Co-authored-by: Simon Mathis <simon.mathis@gmail.com>

padix-key merged commit 5c1e872 into biotite-dev:interfaces Feb 24, 2025
27 of 28 checks passed

padix-key deleted the rdkit branch February 24, 2025 14:32

	"No hydrogen found in the input, although 'explicit_hydrogen' is 'True'"
	"No hydrogen found in the input, although 'explicit_hydrogen' is 'True'. "
	"This may lead to atoms wrongly interpreted as radicals. Be careful."

		assert test_annot.dtype == ref_annot.dtype
		assert test_annot.tolist() == ref_annot.tolist()

Conversation

padix-key commented Jan 24, 2025

Uh oh!

codspeed-hq bot commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging #741 will not alter performance

Summary

Uh oh!

Croydon-Brixton Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

padix-key commented Feb 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

padix-key commented Feb 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Croydon-Brixton commented Feb 12, 2025

Uh oh!

Croydon-Brixton commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Croydon-Brixton commented Feb 12, 2025

Uh oh!

padix-key commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

padix-key commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

padix-key commented Feb 24, 2025

Uh oh!

Croydon-Brixton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

padix-key commented Feb 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codspeed-hq bot commented Jan 24, 2025 •

edited

Loading

Croydon-Brixton Jan 27, 2025 •

edited

Loading

padix-key commented Feb 2, 2025 •

edited

Loading

Croydon-Brixton commented Feb 12, 2025 •

edited

Loading

padix-key commented Feb 15, 2025 •

edited

Loading

padix-key commented Feb 15, 2025 •

edited

Loading